110 research outputs found

    ProbCD: enrichment analysis accounting for categorization uncertainty

    Get PDF
    As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test. We developed an open-source R package to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for
the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/. We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation

    Glutamate 301 of the mouse gonadotropin-releasing hormone receptor confers specificity for arginine 8 of mammalian gonadotropin-releasing hormone

    Get PDF
    The Arg residue at position 8 of mammalian GnRH is necessary for high affinity binding to mammalian GnRH receptors. This requirement has been postulated to derive from an electrostatic interaction of Arg8 with a negatively charged receptor residue. In order to identify such a residue, 8 conserved acidic residues of the mouse GnRH receptor were mutated to isosteric Asn or Gln. Mutant receptors were tested for decreased preference for Arg8-containing ligands by ligand binding and inositol phosphate production. One of the mutants, in which the Glu301 residue was mutated to Gln, exhibited a 56-fold decrease in apparent affinity for mammalian GnRH. The mutant receptor also exhibited decreased affinity for [Lys8]GnRH, but its affinity for [Gln8]GnRH was unchanged compared with the wild type receptor. The apparent affinity of the mutant receptor for the acidic analogue, [Glu8]GnRH, was increased more than 10-fold. The mutant receptor did not, therefore, distinguish mammalian GnRH from analogues with amino acid substitutions at position 8 as effectively as the wild type receptor. This loss of discrimination was specific for the residue at position 8, because the mutant receptor did distinguish mammalian GnRH from analogues with favorable substitutions at positions 5, 6, and 7. These findings show that Glu301 of the GnRH receptor plays a role in receptor recognition of Arg8 in the ligand and are consistent with an electrostatic interaction between these 2 residues

    Proteome Profiling of Breast Tumors by Gel Electrophoresis and Nanoscale Electrospray Ionization Mass Spectrometry

    Get PDF
    We have conducted proteome-wide analysis of fresh surgery specimens derived from breast cancer patients, using an approach that integrates size-based intact protein fractionation, nanoscale liquid separation of peptides, electrospray ion trap mass spectrometry, and bioinformatics. Through this approach, we have acquired a large amount of peptide fragmentation spectra from size-resolved fractions of the proteomes of several breast tumors, tissue peripheral to the tumor, and samples from patients undergoing noncancer surgery. Label-free quantitation was used to generate protein abundance maps for each proteome and perform comparative analyses. The mass spectrometry data revealed distinct qualitative and quantitative patterns distinguishing the tumors from healthy tissue as well as differences between metastatic and non-metastatic human breast cancers including many established and potential novel candidate protein biomarkers. Selected proteins were evaluated by Western blotting using tumors grouped according to histological grade, size, and receptor expression but differing in nodal status. Immunohistochemical analysis of a wide panel of breast tumors was conducted to assess expression in different types of breast cancers and the cellular distribution of the candidate proteins. These experiments provided further insights and an independent validation of the data obtained by mass spectrometry and revealed the potential of this approach for establishing multimodal markers for early metastasis, therapy outcomes, prognosis, and diagnosis in the future. © 2008 American Chemical Society

    GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Since the inception of the GO annotation project, a variety of tools have been developed that support exploring and searching the GO database. In particular, a variety of tools that perform GO enrichment analysis are currently available. Most of these tools require as input a target set of genes and a background set and seek enrichment in the target set compared to the background set. A few tools also exist that support analyzing ranked lists. The latter typically rely on simulations or on union-bound correction for assigning statistical significance to the results.</p> <p>Results</p> <p><it>GOrilla </it>is a web-based application that identifies enriched GO terms in ranked lists of genes, without requiring the user to provide explicit target and background sets. This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression). <it>GOrilla </it>employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the <it>top </it>of a ranked gene list. Building on a complete theoretical characterization of the underlying distribution, called mHG, <it>GOrilla </it>computes an exact p-value for the observed enrichment, taking threshold multiple testing into account without the need for simulations. This enables rigorous statistical analysis of thousand of genes and thousands of GO terms in order of seconds. The output of the enrichment analysis is visualized as a hierarchical structure, providing a clear view of the relations between enriched GO terms.</p> <p>Conclusion</p> <p><it>GOrilla </it>is an efficient GO analysis tool with unique features that make a useful addition to the existing repertoire of GO enrichment tools. <it>GOrilla</it>'s unique features and advantages over other threshold free enrichment tools include rigorous statistics, fast running time and an effective graphical representation. <it>GOrilla </it>is publicly available at: <url>http://cbl-gorilla.cs.technion.ac.il</url></p

    The acceleration of the universe and the physics behind it

    Full text link
    Using a general classification of dark enegy models in four classes, we discuss the complementarity of cosmological observations to tackle down the physics beyond the acceleration of our universe. We discuss the tests distinguishing the four classes and then focus on the dynamics of the perturbations in the Newtonian regime. We also exhibit explicitely models that have identical predictions for a subset of observations.Comment: 18 pages, 18 figure

    Planck 2013 results. XXII. Constraints on inflation

    Get PDF
    We analyse the implications of the Planck data for cosmic inflation. The Planck nominal mission temperature anisotropy measurements, combined with the WMAP large-angle polarization, constrain the scalar spectral index to be ns = 0:9603 _ 0:0073, ruling out exact scale invariance at over 5_: Planck establishes an upper bound on the tensor-to-scalar ratio of r < 0:11 (95% CL). The Planck data thus shrink the space of allowed standard inflationary models, preferring potentials with V00 < 0. Exponential potential models, the simplest hybrid inflationary models, and monomial potential models of degree n _ 2 do not provide a good fit to the data. Planck does not find statistically significant running of the scalar spectral index, obtaining dns=dln k = 0:0134 _ 0:0090. We verify these conclusions through a numerical analysis, which makes no slowroll approximation, and carry out a Bayesian parameter estimation and model-selection analysis for a number of inflationary models including monomial, natural, and hilltop potentials. For each model, we present the Planck constraints on the parameters of the potential and explore several possibilities for the post-inflationary entropy generation epoch, thus obtaining nontrivial data-driven constraints. We also present a direct reconstruction of the observable range of the inflaton potential. Unless a quartic term is allowed in the potential, we find results consistent with second-order slow-roll predictions. We also investigate whether the primordial power spectrum contains any features. We find that models with a parameterized oscillatory feature improve the fit by __2 e_ _ 10; however, Bayesian evidence does not prefer these models. We constrain several single-field inflation models with generalized Lagrangians by combining power spectrum data with Planck bounds on fNL. Planck constrains with unprecedented accuracy the amplitude and possible correlation (with the adiabatic mode) of non-decaying isocurvature fluctuations. The fractional primordial contributions of cold dark matter (CDM) isocurvature modes of the types expected in the curvaton and axion scenarios have upper bounds of 0.25% and 3.9% (95% CL), respectively. In models with arbitrarily correlated CDM or neutrino isocurvature modes, an anticorrelated isocurvature component can improve the _2 e_ by approximately 4 as a result of slightly lowering the theoretical prediction for the ` <_ 40 multipoles relative to the higher multipoles. Nonetheless, the data are consistent with adiabatic initial conditions

    Increasing consistency of disease biomarker prediction across datasets

    Get PDF
    Microarray studies with human subjects often have limited sample sizes which hampers the ability to detect reliable biomarkers associated with disease and motivates the need to aggregate data across studies. However, human gene expression measurements may be influenced by many non-random factors such as genetics, sample preparations, and tissue heterogeneity. These factors can contribute to a lack of agreement among related studies, limiting the utility of their aggregation. We show that it is feasible to carry out an automatic correction of individual datasets to reduce the effect of such 'latent variables' (without prior knowledge of the variables) in such a way that datasets addressing the same condition show better agreement once each is corrected. We build our approach on the method of surrogate variable analysis but we demonstrate that the original algorithm is unsuitable for the analysis of human tissue samples that are mixtures of different cell types. We propose a modification to SVA that is crucial to obtaining the improvement in agreement that we observe. We develop our method on a compendium of multiple sclerosis data and verify it on an independent compendium of Parkinson's disease datasets. In both cases, we show that our method is able to improve agreement across varying study designs, platforms, and tissues. This approach has the potential for wide applicability to any field where lack of inter-study agreement has been a concern. © 2014 Chikina, Sealfon

    Misty Mountain clustering: application to fast unsupervised flow cytometry gating

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 10<sup>6 </sup>points that are often generated by high throughput experiments.</p> <p>Results</p> <p>To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 10<sup>6 </sup>data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment.</p> <p>Conclusions</p> <p>Misty Mountain is fast, unbiased for cluster shape, identifies stable clusters and is robust to noise. It provides a useful, general solution for multidimensional clustering problems. We demonstrate its suitability for automated gating of flow cytometry data.</p
    • 

    corecore